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Abstract 



We consider probabilistic multinomial probit classification using Gaussian process (GP) 
priors. The challenges with the multiclass GP classification are the integration over the 
non-Gaussian posterior distribution, and the increase of the number of unknown latent 
variables as the number of target classes grows. Expectation propagation (EP) has proven 
to be a very accurate method for approximate inference but the existing EP approaches for 
the multinomial probit GP classification rely on numerical quadratures or independence 
assumptions between the latent values from different classes to facilitate the computations. 
In this paper, we propose a novel nested EP approach which does not require numerical 
quadratures, and approximates accurately all between-class posterior dependencies of the 
latent values, but still scales linearly in the number of classes. The predictive accuracy of the 
nested EP approach is compared to Laplace, variational Bayes, and Markov chain Monte 
Carlo (MCMC) approximations with various benchmark data sets. In the experiments 
nested EP was the most consistent method with respect to MCMC sampling, but the 
differences between the compared methods were small if only the classification accuracy is 
concerned. 

Keywords: Gaussian process, multiclass classification, multinomial probit, approximate 
inference, expectation propagation 

1. Introduction 

Gaussian process (GP) priors enable flexible model specification for Bayesian classification. 
In multiclass GP classification, the posterior inference is challenging because each target 
class increases the number of unknown latent variables by n (the number of observations). 
Typically, independent GP priors are set for the latent values for each class and this is 
assumed throughout this paper. Since all latent values depend on each other through the 
likelihood, they become a posteriori dependent, which can rapidly lead to computationally 
unfavorable scaling as the number of classes c grows. A cubic scaling in c is prohibitive, 
and from a practical point of view, a desired complexity is 0{cn^) which is typical for the 
most existing approaches for multiclass GP. The cubic scaling with respect to the number 
of data points is standard for full GP priors, and to reduce this complexity, sparse 
approximations can be used, but these are not considered in this paper. As an additional 
challenge, the posterior inference is analytically intractable because the likelihood term 
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related to each observation is non-Gaussian and depends on multiple latent values (one for 
each class). 

A Markov chain Monte Carlo (M CMC) appro ach for multiclass GP classification with 
a softmax likelihood was described by Neall ( 19981 ). Sampling of the latent values with the 
softmax model is challenging because the dimensionality is often high and standard methods 
such as the Metropolis-Has tings and hybrid Mon t e Car lo algorithms require tuning of the 
step size parameters. Later iGirolami and RogersI ()2006l ) proposed an alternative approach 
based on the multinomial probit likelihood which can be augmented with auxiliary latent 
variables. This enables a convenient Gibbs sampling framework in which the latent values 
are conditionally independent between classes and normally distributed. If the hyperparam- 
eters are sampled, one MCMC iteration scales as 0{cn^) which can become prohibitively 
expensive for large n since typically thousands of posterior draws are required, and strong 
dependency between hyperpa rameters and latent values c auses slow mixing of the chains. 

To speed up the inference, Williams and Barber ( 19981 ) used the Laplace approximation 
(LA) to approximate the non-Gaussian posterior distribution of the latent function values 
with a tractable Gaussian distribution. Conveniently the LA approximation with the soft- 
max likelihood leads to an efficient representation of the approximative posterior scaling 
as 0((c + l)n^), which facilitates considerably the predictions and gradient-based type-II 
maximum a posterior (MAP ) estimation of the covariance function hyperparameters. Later 
Girolami and Roger i (l2006h proposed a factorized variational Bayes approximation (VB) 
for the augmented multinomial probit model. Assuming the latent values and the auxil- 
iary variables a posteriori independent, a computationally efficient posterior approximation 
scheme is obtained. If the latent processes related to each class share the same fixed hy- 
perparameters, VB requires only one 0{n^) matrix inversion per iteration step compared 
to c + 1 such inversions per iteration required for LA. 

Expectation propagation (EP) is the method of choice in binary GP classification where 
it ha s been found very accurate with reasonable computational cost (jKuss and Rasmussen . 
2005 : Nickisch and Rasmussen . 20081 ). Two types of EP approximations have been consid- 
ered for the multiclass setting; the first assuming the latent values from different classes a 
posteri ori independent (^lEP) aiid the second assuming them fullv correlated f Seeger and 



Jordan, l2004l : ISeeger et al.l . l2006l : IGirolami and Zhongl . 120071 ) . Incorporating the full poste- 
rior couplings requires evaluating the non -analytical moments of c-dimensional tilted distri- 
butions which IGirolami and Zhongi (j2007l ) approximated with Laplace's method resu l ting i n 
an app roximation scheme known as Laplace propagation described by ISmola et al.l (j2004l ) . 
Earlier ISeeger and JordanI (j2004l ) proposed an alternative approach where the full posterior 
dependencies were approximated by enforcing a similar structure for the posterior covari- 
ance as in LA using the softmax likelihood. This enables a posterior representation scaling 
as 0{{c+l)n^) but the proposed implementation requires a c-dimensional numerical quadra- 
ture and double-loop opt imization to obtain a restr icted- form site covariance approximation 
for each likelihood term ( Seeger and Jordan . 20041 ) To reduce th e computational dem and 
of EP, factorized pos t erior approximations were proposed by both ISeeger et al.l (j2006l ) and 
Girolami and Zhongl (j2007l ). Both approaches omit the between-class posterior dependen- 
cies of the latent values which results into a posterior representation scaling as 0{cn'^). The 



1. ISeeeer and JordanI (|2004l ) achieve also a linear scaling in the number of training points but we omit 
sparse approaches here. 
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approaches rely on numerical two-dimensional quadratures for evaluating the moments of 
the tilted distributions with the main difference being that ISeeger et al. I (|2006h used fewer 
two-dimensional quadratures for computational speed-up. 

A different EP approach for the multiclass setting was described bv Kim and Ghahra- 
li ( 20061 ) who adopted the threshold function as an observation model. Each threshold 
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likelihood term factorizes into c — 1 terms dependent on only two latent values. This prop- 
erty can be used to transform the inference on to an equivalent non-redundant model which 
includes n(c — 1) unknown latent values with a Gaussian prior and a likelihood consisting 
of n(c — 1) in dependent terms. It follows th at standard EP methodology for binary GP 
classification ( Rasmussen and Williams . 20061 ) can be applied for posterior inference but a 
straightforward implementation results in a posterior re presentation scaling as 0((c— l)^n^) 
and means to improve the scali ng are not discussed by Kim and Ghahramani ( 20061 ). Con- 
trary to the usual EP approach, Kim and Ghahramani (j2006l ) determined the hyperparam- 
eters by maximizing a lower bound on the log marginal likeliho od in a similar way as is 



done in the expectation maximization (EM) algorithm. Recently iHernandez-Lobato et al 



(|201ll ) introduced a robust generalization for the multiclass GP classifier with a threshold 
likelihood by incorporating n additional binary indicator variables for modeling possible 
labeling errors. Efficiently scaling EP inference is obtained by making the lEP assumption. 

In this paper, we focus on the multinomial probit model and describe first an effi- 
cient quadrature-free nested EP approach for multiclass GP classification that scales as 
0{{c + l)n'^). The proposed EP method takes into account all the posterior covariances 
between the latent variables, and the posterior computations are as efficient as in the LA 
approximation. Using the nested EP algorithm, we assess the utility of the full EP approx- 
imation with respect to lEP by comparing the convergence of the algorithms, the qualities 
of the conditional predictive distributions given the hyperparameters, and the suitability 
of the respective marginal likelihood approximations for type-II MAP estimation of the 
covariance function hyperparameters. Finally the predictive performance of the proposed 
full EP approach is compared with lEP, LA, VB, and MCMC using several real-world data 
sets. Since LA is known to be fast, we also test whether the predictive probability estimates 
of LA can be further improved using the Laplace's method as described bv Tiernev and 
Kadane (jlOsi)! 



2. Gaussian processes for multiclass Classification 

We consider a classification problem consisting of d-dimensional input vectors Xj associated 
with target classes yi € {!,..., c}, where c > 2, for i = l,...,n. All class labels are 
collected in the n x 1 target vector y, and all covariate vectors are collected in the matrix 
X = [xi, . . . ,x„]^ of size n X d. Given the latent function values fj = [fl, f^, . . . , /f]"^ = 
f(xj) at the observed input locations Xj, the observations yi are assumed independently and 
identically distributed as defined by the observation model p{yi\ii). The latent vectors from 
all observations are denoted by f = [/j^, . . . , /f , . . . , Z^, . . . , /f , . . . , /^] . 

Our goal is to predict the class membership for a new input vector x* given the observed 
data D = {X, y}, which is why we need to make some assumptions on the unknown function 
/(x). We set a priori independent zero-mean Gaussian process priors on the latent values 
related to each class, which is the usual assumption in multiclass GP classification (see, e.g. 
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Williams and Barber 



(Il998h : ISeeger and JordanI (|2004h : iRasmussen and Williaml (|2006h 



Girolami and Zhona (j2007l )). This specification results in the following zero- mean Gaussian 
prior for f: 

p{{\X)=M{f\0,K), (1) 

where K is a cn x cn blocked diagonal covariance matrix with matrices , , • • • , K'^ 
(each of size n x n) on its diagonal. Element K'y- of the /c'th covariance matrix defines the 
prior covariance between the function values ff and /j^ , which is governed by the covariance 



function k{'x.i,Ji.j), that is, K^j = /t(xj,Xj) = Gov 



ft J 



within the class k. A common 



choice for the covariance function is the squared exponential 

d 



3(Xi 



x,- 



exp 



X 



'j,k 



(2) 



k=l 



where 9 = {cj^, /i, . . . , Id} collects the hyperparameters governing the smoothness properties 
of latent functions. Magnitude parameter cj^ controls the overall variance of the unknown 
function values, and the length-scale parameters li,...,ld control the smoothness of the 
latent function by defining how fast the correlation decreases in each input dimension. The 
framework allows separate covariance functions or hyperparameters for different classes but 
throughout this work, for simplicity we use the squared exponential covariance function 
with the same 9 for all classes. 

In this paper, we consider two different observation models: the softmax 

exp(/f) 



Ei=i exp(/i 



and the multinomial probit 



piVilfi) = E 



n ^("^+/f 




(3) 



(4) 



where ^> denotes the cumulative density function of the standard normal distribution, and 
the auxiliary variable Ui is distributed as p{ui) = A/'(0, 1). The softmax and multinomial 
probit are multiclass generalizations of the logistic and the probit model respectively. 

By applying Bayes' theorem, the conditional posterior distribution of the latent values 
is given by 



1 



p{i\V,9) = -p{i\X,9)llp{y,\i, 



(5) 



where Z = p{y\X,9) = J p({\X,9)Y\^^^p{yi\ii)df is known as the marginal likelihood. 
Both observation models result in an analytically intractable posterior distribution and 
therefore approximative methods are needed for integration over the latent variables. Dif- 
ferent approximate methods are more suitable for particular likelihood function due to the 
convenience of implementation: the softmax is prefer able for LA because o f the e fficient 
structure and computability of the partial derivatives ( Williams and Barber . 19981 ). while 
the probit is preferable f or VB, EP and Gibbs s ampling because of the convenier it auxiliary 
variable representations (jGirolami and Rogersl . I2OO6I : iGirolami and ZhoneJ . 120071 ) . 
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3. Approximate inference using expectation propagation 

In this section, we first give a general description of EP for multiclass GP classification 
and review some existing approaches. Then we present a novel nested EP approach for the 
multinomial probit model. 

3.1 Expectation propagation for multiclass GP 

Expectation propagation is an iterative algori thm for approximating integrals over functions 
that factor into simple terms ( Minka . 2001bl ). Using EP the posterior distribution ([5]) can 



be approximated with 

1 " 

qEp{n^,0) = -—p{i\x,e)T\k(.U\z^,p,i,t,), (6) 

where ij(f /ij, Sj) = ZjA/'(f Sj) are local likelihood term approximations parame- 
terized with scalar normalization terms Zi, c x 1 site location vectors /ij, and c x c site 
covariance terms Sj. In the algorithm, first the site approximations are initialized, and 
then each site is updated in turns. The update for the i'th site is done by first leaving out 
the site from the marginal posterior which gives the cavity distribution 

q.,{U) = ^^{^^\^^~^,^-^) <x q{u\v,9)i{ur\ (7) 

The cavity distribution is then combined with the exact i'th likelihood term p{yi\ii) to form 
the non-Gaussian tilted distribution 

p{U) = Zr\_,{U)p{y,\U), (8) 

which is assumed to encompass more information about the true marginal distribution. 
Next a Gaussian approximation q{fi) is determined for p{ii) by minimizing the Kullback- 
Leibler (KL) divergence KL(p(f j)||g(fj)), which is equivalent to matching the first and 
second moments of q{ii) with the corresponding moments of p(fj). Finally, the parameters 
of the i'th site are updated so that the mean and covariance of q{fi) are consistent with q{ii)- 
After updating the site parameters, the posterior distribution ([6]) is updated. This can be 
done either in a sequential way, where immediately after each site update the posterior is 
refreshed using a rank-c update, or in a parallel way, where the posterior is refreshed only 
after all the site approximations have been updated once. This procedure is repeated until 
convergence, that is, until all marginal distributions q{fi) are consistent withp(fj). 

In binary GP classification, determining the moments of the tilted distribution requires 
solving only one-dimensional integrals, and assuming the probit likelihood function, these 
univariate integrals can be computed efficiently without quadrature. In the multiclass set- 
ting, the probl em is how to evaluate themulti-dimensional integrals to determine the mo- 



ments of ([8|). iGirolami and Zhongl (j2007l ) approximated the moments of ([8]) using the 



Lapl ace approximation which results in an algorithm called Laplace propagation f Smola 



et al., 120041 ). The problem with the LA approach is that the mean is replaced with the 



mode of the distribution and the covariance with the inverse Hessian of the log density at 
the mode. Because of the skewness of the tilted distribution caused by the likelihood func- 
tion, the LA method can lead to inaccurate mean and covariance estimates in which case the 
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resulti ng posterior approximation does not correspond to the full EP solution. Seeger and 



Jordan (|200^ estimated the tilted moments using multi-dimensional quadratures, but this 
becomes computationally demanding when c increases, and to achieve a posterior represen- 
tation scaling linearly in c, they do an additional optimization step to obtain a constrained 
site precision matrix for each likelihood term approximation. In our approach, we do not 
need numerical quadratures and we get a similar efficient representation for the site precision 
simultaneously. 

Computations can be facilitated by using the lEP approximation where explicit between- 
class posterior dependencies are omitted. The posterior update with lEP scales linearly in 
c, but the existing approaches for the multinomial pr obit require multiple numeri cal quadra- 
tures for each site update. The implementation of lGirolami and Zhongj ()2007l ) requires a 
total of 2c + 1 tw'o -dimensional numerical quadratures for each likelihood term, whereas 
Seeger et al.l t[)0(h described an alternative approach where only two two-dimensional and 
2c — 1 one-dimensional quadratures are needed. In the proposed nested EP approach a 
quadrature-free lEP approximation can be formed with similar computational complexity 
as the full EP approximation. Compared to the full EP approximation, lEP underestimates 
the uncertainty on the latent values and in practice it can converge slower than full EP es- 
pecially if the hyperparameter setting results into strong between-class posterior couplings 
as will be demonstrated later. 



3.2 Efficiently scaling quadrature-free implementation 

In this section, we present a novel nested EP approach for multinomial probit classification 
that does not require numerical quadratures or sampling for estimation of the tilted moments 
or predictive probabilities. The method also leads naturally to low-rank site approximations 
which retain all posterior couplings but results in linear computational scaling with respect 
to the number of target classes c. 

3.2.1 Quadrature-free nested expectation propagation 

Here we use the multinomial probit as the likelihood function because its product form 
consisting of cumulative Gaussian factors is computationally more suitable for EP than 
the sum of exponential terms in the softmax likelihood. Given the mean fx-i and the 
covariance S„j of the cavity distribution, we need to determine the normalization factor Zi, 
mean vector /ij, and covariance matrix Sj of the tilted distribution 



n ^(^^+/f 



(9) 



which requires solving non-analytical (c+ l)-dimensional integrals ove r f,; and Uj. Instead 



of qu adrature methods ([Girolami and Zhonei . 120071 : ISeeger and Jordanl . l2004l : ISeeger et al 



200fil ). we use EP to approximate these integrals. At first, this approach may seem com- 
putationally very demanding since individual EP approximations are required for each of 
the n sites. However, it turns out that these inner EP approximations can be updated in- 
crementally between the outer EP loops. This scheme also leads naturally to an efficiently 
scaling representation for the site precisions 
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To form a computationally efficient EP algorithm for approximating the tilted moments, 
it is helpful to consider the joint distribution of f j and the auxiliary variable Ui arising from 



Defining Wj = [fj 



Ui 



and removing the marginalization over Ui results in the following 



augmented tilted distribution: 



p{wi) = Z,- A/'(wi[/i„^,S^ 



n '^•(-' 



(10) 



where fi^fi = [M^i; 0]^ and T,^^ is a block-diagonal matrix formed from S_j and 1. Denoting 
the j'th unit vector of the c-dimensional standard basis by ej, the auxiliary vectors hij can 
be written as b,- 



1]^ 



The normalization term Zi is the same for p(fj 
and p(wj), and it is defined by Zj = f 7V(wjj/^„^, S„J J^^y^^^ <I>(wf bjj)(iwj. The other 

quantities of interest, fii and Sj, are equal to the marginal mean and covariance of the first 
c components of Wj with respect to p(wi). 

The augmented distribution (llOp is of similar functional form as the posterior distri- 
bution resulting from a linear binary classifier with a multivariate Gaussian prior on the 
weights Wj and a probit likelihood function. Therefore, the mome nts of (flUl) can be ap- 
proximated with EP similarly as in linear classification (see, e.g., iQi et al.l . |2004| ) or by 
applvi ng the general EP formulation for latent Gaussian models described bv Cseke and 



Heskes (j201ll . appendix C). For clarity, we have summarized a computationally efficient 
implementation of the algorithm in Appendix |Al The augmented tilted distribution pop is 
approximated by 



g(wi) = Z^^W(wi|/i„,,E„J Yl %J-^(wfbij|a^J/3ij,aJ)(iw,, 



(11) 



where the cumulative Gaussian functions are approximated with scaled Gaussian site func- 
tions and the normalization constant Zi is approximated with Z^.. From now on the site 
parameters of (/(wj) in their natural exponential form are denoted by = and 

3.2.2 Efficiently scaling representation 



In this section we show that approximation (fTTI) leads to matrix computations scaling as 
0((c+ l)n^) in the evaluation of the moments of the approximate posterior ([6]). The idea 
is to show that the site precision matrix resulting from the EP update step with Sj 
derived from (jlip has a, similar structure wi t h the Hessian matrix of logp( ?/,;|fj) in the 
Laplace appr oximation ( Williams and Barber ^ 19981: Seeeer and Jordan ^ 2004 ^Rasmussen 
and WilliamsTEooi)! 

The approximate marginal covariance of fj derived from (fTTI) is given by 



5^w,^ + BifiBj 



-1 



(12) 



where Tj = diag(Q;i), Bi = [bjj]j^y^, and = \^ Ic ] picks up the desired components 
of Wj, that is, fj = H^Wi. Using the matrix inversion lemma and denoting Bi = Bi = 
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Gy^l^ — E^y., where E-y. = [ej]jyLy. and lisac— 1x1 vector of ones, we can write the 
tilted covariance as 

= iJ:-_l + B,{fr^ + 11^)-' Bf)-\ (13) 

Because in the moment matching step of the EP algorithm the site precision matrix is 
updated as T,^^ = f^^^ — '^Zj, we can write 

= B,{fr^ + ll^r'Bf = Bi{f, - a,(l + 1^ cyi)-^al)Bj . (14) 

Since Bj is a c x c — 1 matrix, we see that is of rank c — 1 and therefore a straightforward 
implementation based on (jl4p would result into 0((c — l)^n^) scaling in the posterior up- 
date. However, a more efficient representation can be obtained by simplifying (jl4p further. 
Writing Bi = —AiE^y-, where Ai = [Ic — Gy-lJ] and Ic is a c x 1 vector of ones, we get 

Sri = A, [E^y^ETy^ - 7ri(1^7r,)-Vf ) A[ , (15) 

where we have defined tTj = E^y-dn + e^- and used Bidn = —AiTVi. Since Aiey- = we can 
add Gy-el. to the first term inside the brackets to obtain 

t-^ = AiUiAj = Ui, where = diag(7ri) - (lj7ri)-Vi7rf . (16) 

The second equality can be explained as follows. Matrix 11 j is of similar form with the 
precision c ontribution of the i'th like lihood term, Wi = — Vj , logp(yj|fi), in the Laplace 
algorithm ( Williams and Barbeil . 19981 ). and it has one eigenvector, Ic, with zero eigenvalue: 
Iljlc = 0. It follows that AiUi = {Ic- ey^l'^)Ui = Ui-ey.O'^ = Ilj and therefore S^"^ = Ilj. 
Mat rix Hj is also precisely of the same form as the a priori constrained site precision block 
that ISeeger and JordanI (j2004l ) determined by double- loop optimization of KL(g(fj)||(7(fj)). 

In a similar fashion, we can determine a simple formula for the natural location param- 
eter Ui = Yj~^fii as a function of on and ^j. The marginal mean of fj with respect to (/(wj) 
is given by 

fi, = Hf [t.^.] + B^Bj) + BS^) , (17) 

which we can write using the matrix inversion lemma as 

(i^ = tiT.z\^l-^ + ^-iBi{f-^ + 11^ + Bj^_,Biy^fr^pi. (18) 

Using the update formula Vi = S^^/Xj — T^zlfJ'-i resulting from the EP moment matching 
step and simplifying further with the matrix inversion lemma, the site location v>i can be 
written as 



i>i = Bi - aiaij = a^TTi - E_y-f3i, (19) 

where ai = (l^/3i)/(1^7rj). The site precision vector Ui is orthogonal with Ic, that is, 
lji>i = 0, which is congruent with ([16]), Note that with results (fT6|) and (fT9|) . the mean 



8 



Nested EP for GP classification with Multinomial Probit 



and covariance of the approximate posterior ([6]) can be evaluated using only Qj and f3i. 
It follows that the posterior (predictive) means and covariances as well as the marginal 
likelihood cai i be evaluated with sim ilar com putational complexity as with the Laplace ap- 
proximation (IWilliams and Barbe 3, [l998 ; Ra smussen and Williamsl . boOfil ) . For clarity the 



main components are summarized in Appendix B. The lEP approximation in our imple- 
mentation is formed by matching the i'th marginal covariance with diag(diag(Sj)), and the 
corresponding mean with /ij. 

3.2.3 Efficient Implementation 

Approximating the tilted moments using inner EP for each site may appear too slow for 
larger problems because typically several iterations are required to achieve convergence. 
However, the number of inner-loop iterations can be reduced by storing the site parameters 
Qj and Pi after each inner EP run and continuing from the previous values in the next 
run. This framework where the inner site parameters cti and /3j are updated iteratively 
instead of /ij and Sj, can be justified by writing the posterior approximation ([6]) using the 
approximative site terms from (jlip : 



n „ c 

q{f\V, 9) oc p{i\X, e)l[J J\f{u,\0, 1) J] 4jAr(n, + /f - //lar/Aj, a-/)dn,. (20) 

Calculating the Gaussian integral over Ui leads to the same results for /ij and Sj as derived 
earlier (equations 1161 and I19p . Apart from the integration over the auxiliary variables Uj, 
equation (j20p resembles an EP approximation where n(c— 1) probit terms of the form $(tij + 
fi'^ — //) are approximated with Gaussian site functions. In the standard EP framework we 
form the cavity distribution (/_j(f j) by removing c — 1 sites from ()20p and subsequently refine 
Qj and (3i using the mean and covariance of the tilted distribution ([9]) . If we alternatively 
expand the i'th site approximation with respect to n, and write the corresponding marginal 
approximation as 

q{ii\V,9) cc q^iiU) j N{u,\Q, 1) \{ Zg^,jN{u, + /f - // jcir^ft,,, 0,^1)^^,, (21) 

we can update only one of the approximative terms in (f2T]) at a time. This is equivalent to 
starting the inner EP iterations with the values of on and Pi from the previous outer-loop 
iteration instead of a zero initialization which is customary to standard EP implementations. 
In our experiments, only one inner-loop iteration per site was found sufficient for convergence 
with comparable number of outer-loop iterations, which results in significant computational 
savings in the tilted moment evaluations. 

The pre vious interpretation of the algorithm is also useful for defining damping f Minka 
and LaffertyliooJ^ which is commonly used to improve the numerical stability and conver- 



gence of EP. In damping the site parameters in their natural exponential forms are updated 
to a convex combination of the old and new values. Damping cannot be directly applied on 
the site precision matrix 11^ = because the constrained form of the site precision (fT6|) is 
lost. Instead we damp the updates on ctj and Pi which preserves the desired structure. This 
can be justified with the same arguments as in the previous paragraph where we considered 
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updating only one of the approximative terms in ([2T]) at a time. Convergence of the nested 
EP algorithm with full posterior couplings using this scheme is illustrated with different 
damping levels in Section 5.3. 



4. Other approximations for Bayesian inference 

In this section we discuss all the other approximations considered in this paper for multiclass 
GP classification. First, we give a short description of the LA method. Then we show how 
it can be improved upon by computing corrections to the rna.rgina l predictive densities 
using Laplace's method as described by iTiernev and Kadand (|l986l ). Finally, we briefly 
summarize the VB and MCMC approximations. 



4.1 Laplace approximation 

In the Laplace approximation a second order Taylor expansion of logp{{\'D,6) is made 
ar ound the posterior mode f wh ich c an be determined using Newton's method as described 
by Williams and Barber ( 19981 ) and Rasmussen and Williams ( 20061 ) . This results in the 
posterior approximation 



(22) 



where W = — VVlogp(y|f)|f^|. and p(y\i) = YYi=iP(.yi\^i)- With the softmax likelihood 
(l3|), the sub-matrix of W related to each observation will have similar structure with Ilj 
in (fTBl) . which enables efficient posterior computations that scale linearly in c as already 
discussed in the case of EP. 



4.1.1 Improving marginal posterior distributions 

In Gaussian process classification, the LA and EP methods can be used to efficiently form 
multivariate Gaussian approximation f or the posterior distribution of the latent values. Re- 
cently, motivated by the earlier ideas of lTierney and Kadane (Il98fil ). two methods have been 
proposed for improving the marginal posterio r distributions in latent Gaussian models; one 
based on s ubsequent use of Laplace's method ( Rue et all. 2009 ). and one based on EP f Cseke 



and Heskes. l201ll ). Because in classification the focus is not on the predictive distributions 
of the latent values but on the predictive probabilities related to a test input x^,, applying 
these methods would require additional numerical integration over the improved posterior 
approximation of the corresponding latent value = f(x*). In multiclass setting integra- 
tion over a multi-dimensional space is required which becomes computationally demanding 
to perform, e.g., in a grid, if c is large. To avoid this integration, we test computing the cor- 
re ctions directly for the predi ctive class probabilities following another approach presented 
by lTierney and Kadand (119861 ). A related idea for approximating the predictive distribution 
of linear model coefficients d i rectly with a deterministic approximation has been discussed 
bv lSnelson and Ghahramanil (j2005l ). 

The posterior mean of a smooth and positive function g{f) is given by 



mf)] 



1 9{f)p{y\f)pif)df 



(23) 
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where p(y \ f) is the Hkehhood function, and p{f) the prior distribution. iTiernev and Kadane 



(|l986h aroposed to approximate both integrals in ()23p separately with the Laplace's method 



This approach can be readily applied for approximating the posterior predictive probabilities 
p{y*\^*) of class memberships € {1, c} which are given by 

p(y,|x„P) = ^ JJ p(?/*|f*)p(f*|f)p(f>(y|f)dfdf., (24) 

where Z = JJ p(f^,|f)p(f)p(y|f)(if(if* = J p{f)p{y\i)di is the marginal likelihood. With 
fixed class label y* the integrals can be approximated by a straightforward application 
of either LA or EP, which is already done for the marginal likelihood Z in the standard 
approximations. The LA method can be used for smooth and positive functions such as the 
softmax whereas EP is applicable for a wider range of models. 

The integral on the right side of (p^ equivalent to the marginal likelihood resulting 
from a classification problem with one additional training point y*. To compute the pre- 
dictive probabilities for all classes, we evaluate this extended marginal likelihood of n + 1 
observations with fixed to the c possible class labels. This is computationally demanding 
because several marginal likelihood evaluations are required for each test input. Additional 
modifications, for instance, initializing the latent values to their predictive mean implied by 
standard LA, could be done to speed up the computations. Since further approximations 
can only be expected to reduce the accuracy of the predictions, we do not consider them 
in this paper, and focus only on the naive implementation due to its ease of use. Since LA 
is known to be fast, we test the goodness of the improved predictive probability estimates 
using only LA, and refer to the method as LA-TKP as an extension to the naming used by 



Cseke and HeskesI (|201l|). 



4.2 Markov chain Monte Carlo 

Because MCMC estimates become exact in the limit of infinite sample size, we use MCMC 
as a gold standard for measuring the performance of the other approximations. Depending 
on the likelihood, we use two different sampling techniques; scaled Metropolis-Hastings 
sampling for the softmax, and Gibbs sampling for the multinomial probit. 

4.2.1 Scaled Metropolis-Hastings sampling for softmax 

To obtain samples from the full posterior with the softmax likelihood, the following two 
steps are alternated. Given the hyperparameter values, the latent values are drawn from 



the c onditional posterior p{i\X,6,y) using the scaled Metropolis-Hastings sampling (jNeal 



19981 ). Then, the hyperparameters ar e drawn frorn the conditional p osterior p{6\f) using 



the Hamiltonian Monte Carlo (HMC) (jPuane et al.l . Il987l : iNeall . Il996l ) 



4.2.2 Gibbs sampling for Multinomial probit 



Girolami and Rogers described how to draw samples from the joint posterior using 



the Gibbs sampler. The multinomial probit likelihood ^ can be written in the form 

p(y,|fO = J i^ivf > vfyk ^ y^)fl^^{v{\f^, i)dv„ (25) 
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where Vj = [vj , ...,v'^]'^ is a vector of auxiliary variables, and "0 is the indicator function 
whose value is one if the argument is true, and zero otherwise. Gibbs sampling can then 
be employed by drawing samples alternately for all i from p(vj|fj,yj), which are a conic 
truncated multivariate Gaussian distributions, and from p{i\v,9) which is a multivariate 
Gaussian distribution. Given v and f, the hyperparameters can be drawn, for example, 
using HMC. 

4.3 Factorized variational approximation 

A computationally convenient variational Bayesian (VB) approximation for p{{\T>, 9) can 
be formed by employing the auxiliary variable r epres entation (pS]) of the multinomial probit 
likelihood. As shown by Girolami and Rogers! (j2006l ). assuming f a posteriori independent 



of V leads to the following approximation 

n c 

(7Vb(v, f |P, 9) = q{^)q{i) = J] q{^,) J] q{i^), (26) 

i=l k=l 

where the latent values associated with the /s'th class, f'^, are independent. The posterior 
approximation qff^) will be a multiv ariate Gaussian, and (/(vj) a conic truncated multivari- 
ate Gaussian (iGirolami and Roger iliooe). Given the hyperparameters, the parameters of 



(;(v) and q{i) can be determined iteratively by maximizing a variational lower bound on the 
marginal likelihood. Each iteration step requires determining the expectations of Vj with 
respect to g(vj) which can be obtained by either one dimensional numerical quadratures or 
sampling methods. In our implementation, the hyperparameters 9 are determined by maxi- 
mizing the variational lower bound with fixed g(v) and g(f ) similarly as in the maximization 
step of the EM algorithm. 

5. Experiments 

This section is divided into four parts. In Section 5.1 we illustrate the properties of the 
proposed nested EP and lEP approximations in a simple synthetic classification problem. 
In Section 5.2, we compare the quality of the approximate marginal distributions of f , the 
marginal likelihood approximations and the predictive class probabilities between EP, lEP, 
VB and LA in a three class sub-problem extracted from the USPS digit data. In section 
5.3, we discuss the computational complexities of all the previously considered approximate 
methods, and in Section 5.4, we evaluate them in terms of predictive performance with 
estimation of the hyperparameters using several datasets. In section 5.3 we show that nested 
lEP results in almost indistinguishable marginal likelihood and predictive density estimates 
co mpared to th e quadrature-based lEP approximation obtained with the implementation 
of Seeger et al.l (j2006). Therefore, in all other experiments the nested EP approach is used 



to construct both full EP and lEP approximations. 

5.1 Illustrative comparison of EP and lEP with synthetic data 

Figure [1] shows a synthetic three-class classification problem with scalar inputs. The sym- 
bols X (class 1), + (class 2), and o (class 3) indicate the positions of n = 15 training inputs 
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Figure 1 : A synthetic one-dimensional example of a three class classification problem, where 
MCMC, EP and lEP approximations are compared. The symbols x (class 1), + 
(class 2), and o (class 3) in the bottom of plots indicate the positions of n = 15 
observations. Plot (a) shows the predicted class probabilities, and (b) shows the 
predicted latent mean values for all three classes. The symbols Xi and xj indi- 
cate two example positions, where the marginal distributions between the latent 
function values are illustrated in Figures [2] and [3l See the text for explanation. 



generated from three normal distributions with means -1, 2, and 3, and standard deviations 
1, 0.5, and 0.5, respectively. The left-most observations from class 1 can be better sepa- 
rated from the others but the observations from classes 2 and 3 overlap more in the input 
space. We fixed the hyperparameters of the squared exponential covariance function at the 
corresponding MCMC means: log((T^) = 4.62 and log(/) = 0.26. 

Figure HJa) shows the predictive probabilities of all the tree classes estimated with 
EP, lEP and MCMC as a function of the input x. At the class boundaries, the methods 
give similar predictions but elsewhere MCMC is the most confident while lEP seems more 
cautious. The performance of EP is somewhere between MCMC and lEP, although the 
differences are small. To explain why the predictions differ, we look at the qualities of 
the approximations made for the underlying f. Figure [D^b) shows the approximated latent 
mean values which are similar at all input locations. 

To illustrate the approximate posterior uncertainties of f , we visualize two exemplary 
marginal distributions at locations Xi and xj marked in Figure [TJ The MCMC samples of // 
and (the latents associated with classes 1 and 2 related to Xj) together with a smoothed 
density estimate are shown in Figure [2]^a). The marginal distribution is non-Gaussian, 
and the latent values are more likely larger for class 1 than for class 2 indicating a larger 
predictive probability for class 1. The corresponding EP and lEP approximations are shown 
in Figures [2l^b)-(c). EP captures the shape of the true marginal posterior distribution better 
than lEP. To illustrate the effect of these differences on the predictive probabilities, we show 
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10 ,20 30 40 10 ,20 30 40 10 ,20 30 40 

(d) (e) (f) 

Figure 2: An example of a non-Gaussian marginal posterior distribution for the latent values 
at the input Xj in the synthetic example shown in Figure [TJ The first row shows 
the distribution for the latents and . Plot (a) shows a scatter-plot of MCMC 
samples drawn from the posterior and the estimated density contour levels which 
correspond to the areas that include approximately 95%, 90% 75%, 50%, and 
25% of the probability mass. Plots (b) and (c) show the equivalent contour 
levels of EP and lEP approximations (bold black lines) and the contour plot of 
MCMC approximation (gray lines) for comparison. Plots (d)-(f) show contours 
of p(gj|2?, Xi) for gf and gf. The probability for class 1 is obtained by calculating 
the integral over gj, which results in approximately 0.95 for MCMC, 0.88 for EP, 
and 0.82 for lEP. See the text for explanation. 



the unnormalized tilted distributions 

c 

p{gi\V,Xi)=q{g,\V,^,) II <S>{gf), (27) 

k=l,k^yi 

where the random vector gj is formed from the transformed latents gf = wfhij for k ^ yi, 
and g(gj|I',Xj) is the approximate marginal obtained from g(fj|D,Xj) by a linear transfor- 
mation. Note that the marginal predictive probability for class label yi with the multinomial 
probit model ([4]) can be obtained by calculating the integral over gj in ([27|l . Figures[2|^d)-(f) 
show the contours of the different approximations of p{g^\'D,Xi) for k E {2,3}, which for 
MCMC are obtained using a smoothed estimate of q{g^\'D,Xi). The distributions are heav- 
ily truncated by the probit factors elsewhere than the upper-right quadrant. Compared to 
the MCMC estimate, lEP places more probability mass to the other quadrants, and there- 
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MCMC EP lEP 




-2 -2 -2 



10 , 20 30 10 1 20 30 10 < 20 30 

Sj 

(d) (e) (f) 

Figure 3: An example of a close-to-Gaussian but non-isotropic marginal posterior distribu- 
tion for the latent values at the input xj in the synthetic example shown in Figure 
[TJ The first row shows the distribution for the latents fj and Plot (a) shows 
a scatter-plot of MCMC samples drawn from the posterior and the estimated 
density contour levels which correspond to the areas that include approximately 
95%, 90% 75%, 50%, and 25% of the probabihty mass. Plots (b) and (c) show 
the equivalent contour levels of EP and lEP approximations (bold black lines) 
and the contour plot of MCMC approximation (gray lines) for comparison. Plots 
(d)-(f) show contours oi p{gj\V, Xj) for and gj. The probability for class 2 is 
obtained by calculating the integral over g^-, which results in approximately 0.47 
for all the methods. See the text for explanation. 



fore underestimates the predictive probability for class 1 more than EP. The approximate 
predictive probabilities are 0.95 for MCMC, 0.88 for EP, and 0.82 for lEP. 

The location xj is near the class boundary, where all the methods give similar predic- 
tive probabilities, although the latent approximations can differ notably as shown in Figures 
[3l^a)-(c), which visualize the marginal approximations for /? and fj. EP is consistent with 
the MCMC estimate but due to the independence constraint lEP seriously underestimates 
the uncertainty of this close-to-Gaussian but non-isotropic marginal distribution. Although 
Figures [3|^d)-(f) show that lEP is more inaccurate than EP, the integral over the tilted 
distribution of is in practice the same, since equal amount of probability mass is dis- 
tributed on both sides of the diagonal in Figure [3l^c) . The predictive probabilities for class 
2 is approximately 0.47 for all the methods. 
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EP lEP VB LA U-TKP 




(a) (b) (c) (d) (e) 



EP lEP VB LA LA-TKP 




°0 0.5 1 % 0.5 1 % 0.5 1 % 0.5 1 % 0.5 1 

(f) (g) (h) (i) (j) 

Figure 4: Class probabilities on the USPS 3 vs. 5 vs. 7 data. The MCMC estimates are 
shown on x-axis and EP, lEP, VB, LA, and LA-TKP on y-axis. The first row 
shows the predictive probabilities of the true class labels for training and the 
second row for test points. The symbols (x, +, o) corresponds to the handwritten 
digit target classes 3, 5, and 7. The hyperparameters of the squared exponential 
covariance function were fixed at log(cr^) = 4 and log(/) = 2. 



5.2 Approximate marginal densities v^ith the USPS data 

In this section, we compare the predictive performances and marginal likelihood approxi- 
mations of EP, lEP, VB and LA. We define a three class sub-problem from the US Postal 
Service (USPS) repartitioned handwritten digits data by considering classification of 3's 
vs. 5's vs. 7'so The data consists of 1157 training points and 1175 test points with 256 
covariates. We fixed the hyperparameter values at log(fT^) = 4 and log(Z) = 2 which leads 
to skewed non-Gaussian marginal posterior distributions as will be illustrated shortly. 

Figure [4] shows the predictive probabilities of the true class labels for all the approximate 
methods plotted against the MCMC estimate. The first row shows the training and the 
second row the test cases. Overall, EP gives the most accurate estimates while lEP slightly 
underestimates the probabilities for the training cases but performs well for the test cases. 
Both VB and LA underestimate the predictive probabilities for the test cases, but LA-TKP 
with the marginal corrections clearly improves the estimates of the LA approximation. 

Figure [5] shows an example of the latent marginal posterior distributions for one training 
point with the correct class label being 2. For each method, the latent pairs ), ), 

and {fi,fi), are shown. The EP approximation agrees reasonably well with the MCMC 
samples. lEP underestimates the latent uncertainty, especially near the training inputs 
because of the skewing effect of the likelihood. This seems to affect more the predictive 
probabilities of the training points in Figure Hl^b), which can be seen also in Figure [TJ a) 

2. We use the same data partition as discussed by iRasmussen and Williamsl l|2006h . 
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Figure 5: Marginal posterior distributions for one training point with the true class label 
being 2 on the USPS 3 vs. 5 vs. 7 data. Each row corresponds to one of the 
latent pairs [flJl), {fiJf), and {flJf)- The first column shows a scatter-plot 
of MCMC samples drawn from the posterior, and the estimated density contour 
levels which correspond to the areas that include approximately 95%, 90% 75%, 
50%, and 25% of the probability mass. The columns 2-5 show the equivalent 
contour levels of EP, lEP, VB and LA approximations (bold black lines) and the 
contour plot of MCMC approximation (gray lines) for comparison. Note that 
the last column visualizes a difi'erent marginal distribution because LA uses the 
softmax likelihood. The hyperparameters of the squared exponential covariance 
function were fixed at log(cr^) = 4 and log(/) = 2 to obtain a non-Gaussian 
posterior distribution. 



further away from the decision boundary near the input Xi . Figure [5] shows that the VB 
method seriously underestimates the latent uncertainty. The independence assumption of 
VB leads to an isotropic approximate distribution, and although the predictive probabilities 
for the training cases are somewhat consistent with MCMC, the predictions on the test data 
fail (plots (c) and (h) in FigureU]). The LA approximation captures some of the dependencies 
between the latent variables associated with different classes, but the joint mode of f is a 
poor estimate for the true mean, which causes inaccurate predictive probabilities (plots (d) 
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Figure 6: Marginal likelihood approximations and predictive performances as a function of 
the log-lengthscale log(/) and log-mag nitude log((T2) for EP, lEP, VB and LA 
on USPS 3 vs. 5 vs. 7 data. The first row shows the log marginal likelihood 
approximations, the second row the log predictive densities in a test set, and the 
third row the classification accuracies in a test set. 



and (i) in Figure H]). The VB mean estimate is also closer to LA than MCMC, although 
LA uses a different obser vation model. 

Kuss and Rasmussen ( 2005 ) and Nickisch and Rasmussen ( 20081 ) discussed how a large 
value of the magnitude hyperparameter can lead to a skewed posterior distribution in 
binary classification. In the multiclass setting, similar behavior can be seen in the marginal 
distributions as illustrated in Figures [2] and El A large o"^ leads to a more widely distributed 
prior which in turn is truncated more strongly by the likelihood where it disagrees with the 
target class. In the previous comparison, the hyperparameter values were chosen to pro- 
duce non-Gaussian marginal posterior distributions for demonstration purposes. However, 



usually the hyp_e 



and Rasmussen pOOSi ) and 



parameters are estimated bv maxim 



Nickisch and Rasmussen 



zing the marginal likelihood. Kuss 



20081 ) studied the suitability of the 



marginal likelihood approximations for selecting hyperparameters in binary classification. 
They compared the calibration of predictive performance and the marginal likelihood esti- 
mates on a grid of hyperparameter values. In the following, we extend these comparisons 
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to multiple classes with the US PS dataset, for which similar considerations were done by 
Rasmussen and Williams ( 20061 ) with the LA method. 



The upper row of Figure [6] shows the log marginal likelihood approximations for EP, 
lEP, and LA, and the lower bound on evidence for VB as a function of the log-lengthscale 
log(Z) and log-magnitude log(cT^) using the USPS 3 vs. 5 vs. 7 data. The middle row 
shows the log predictive densities evaluated on the test set, and the bottom row shows 
the corresponding classification accuracies. The marginal likelihood approximations and 
predictive densities for EP and lEP appear to be similar, but the maximum contour of the 
log marginal likelihood for lEP (the contour labeled with -166 in plot (b)) does not coincide 
with the maximum contour of the predictive density (the contour labeled with -76.5 in plot 
(f)), which is why a small bias can occur if the approximate marginal likelihood is used 
for selecting hyperparameters. With EP there is a good agreement between the maximum 
values in plots (a) and (e), and overall, the log predictive densities are higher than with the 
other approximations. The log predictive densities of VB and LA are small where log((T^) is 
large (regions where qiflT), 9) is likely to be non-Gaussian), but also the marginal likelihood 
approximations favor the areas of smaller log((T^) values. 

There is a reasonable agreement with the marginal likelihood approximations and clas- 
sification accuracies with EP and lEP, although the maximum accuracies are slightly lower 
than with VB and LA. The maximum accuracies are very high with VB, but the region of 
the highest accuracy does not agree with the region of the highest estimate of the marginal 
likelihood. With LA the marginal likelihood estimate is calibrated better with the classi- 
fication accuracy, but the performance worsens when the posterior distribution is skewed 
with large values of log((T'^). 

5.3 Computational complexity and convergence 

In this section we consider the computational complexities of the approximate methods for 
one iteration with fixed hyperparameter values. Note that the following discussion is only 
approximate, and the practical efficiency of the algorithms depends much on implementa- 
tions and the choices of convergence criteria. 

Table [T] summarizes the approximate scaling of the number of computations as a func- 
tion of n and c. EP and lEP refer to the fully coupled and class-independent approxi- 
mations, respectively, determined with the proposed nested EP algo rithm. QIEP refer s 



to the quadrature-based class-independent approximation proposed bv lSeeger et al.l (|2006l ) 
and MCMC refers to Gibbs sampling with the multinomial probit model. The first row 
(Posterior) describes the overall scaling of the mean and covariance calculations related to 
the approximate conditional posterior of f . The base computational cost resulting from the 
full GP prior scales as C(n^) due to the n x n matrix inversion (in practice computed using 
Cholesky decomposition), which is required c times for lEP, QIEP, and MCMC, and one 
additional time for EP and LA due to incorporation of the between-class correlations. If 
the same prior covariance structure is used for all classes, VB has the lowest cost, because 
only one matrix inversion is required per iteration. 

The second row (Likelihood) approximates the scaling of the number of calculations 
that are required besides the posterior mean and covariance evaluations (mainly likelihood 
related computations for one iteration). For both EP and lEP, this row describes the scaling 
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Table 1: Approximate computational complexities of the various methods as a function of 
n and c for one iteration with fixed hyperparameters. The first row summarizes 
the scaling of the mean and covariance calculations related to the approximate 
conditional posterior of f . The second row approximates the scaling of the number 
of calculations required for additional likelihood related computations. Parameter 
nin refers to the number of inner EP iterations in the nested EP, and to the 
cost of one- and two-dimensional numerical quadratures respectively, and Us to 
the cost of sampling from a conic truncation of a c-variate Gaussian distribution. 





EP 


lEP 


QIEP 


VB 


LA 


MCMC 


Posterior 
Likelihood 


nnin{c — 1)'^ 


nnin(c — 1)^ 


n((2c — l)nq 
+2n2) 


n{c — l)2nq 


{c+ l)n^ 
nc 


cn' 
ncns 



of the computations needed for the tilted moment approximations done with inner EP 
algorithm. For QIEP the second row summarizes the number of one- and two-dimensional 
numerical quadratures (denoted by riq and n'^ respectively) required for the tilted moment 
evaluations under the independence assumption, and for LA the number of calculations 
required for evaluating the first and second order derivatives of the softmax likelihood. 
Each VB iteration requires evaluating the expectations of the auxiliary variables either by a 
quadrature or sampling, and the cost of one such operation is denoted by riq (for example, 
the number of quadrature design points). Gibbs sampling with the multinomial probit 
likelihood requires drawing from the conic truncation of a c-dimensional normal distribution 
for each observation, and the cost of one draw is denoted by ng. The QIEP solution can 
be implemented efficiently because same function evaluations can be utilized in all of the 
2c — 1 one-dimensional quadratures and the number of two-dimensional quadratures does 
not depend on c. The cubic scaling in c — 1 of the tilted moment evaluations in the nested 
EP and lEP algorithms can be alleviated by reducing the number of inner-loop iterations 
TT/jj^ as discussed in Section 3.2.3. 

Using the USPS 3 vs. 5 vs. 7 dataset, we measured the CPU time required for the 
posterior inference on f given nine different preselected hyperparameter values from the 
grid of Figure El With our implementations, LA was the fastest, and EP and VB were 
about three times more expensive than LA. Because of the efficient scaling (Table [1]), VB 
should be much faster, and probably closer to the running time of LA. One reason for the 
slow performance may be our implementation based on importance sampling steps, which 
may result in slower convergence due to ffuctuations. The MCMC and LA-TKP approaches 
were overall very slow compared to LA. One iteration of MCMC is relatively cheap, but in 
our experiments thousands of posterior samples were required to obtain chains of sufficiently 
uncorrelated samples which is why MCMC was over hundred times slower than LA. LA- 
TKP requires roughly c + 1 times the CPU time of LA for computing the predictions for 
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each test input. Therefore, the computational cost of LA-TKP becomes quickly prohibitive 
as the number of test input increases. 

In the CPU time comparisons across the range of hyperparameter values producing 
variety of skewed and non-isotropic posterior distributions, fully coupled nested EP con- 
verged in fewer outer-loop iterations than nested lEP if the same convergence criteria were 
used. Figure [7] illustrates the difference with the hyperparameters fixed at log((T^) = 8 
and log(^) = 2.5 which results in good predictive performances on the independent test 
dataset with both methods (see Figure [6]). Figure [7] shows the log marginal likelihood ap- 
proximation — log Zep and the mean log predictive density (mlpd) in the test dataset after 
each iteration for both approximations. Note that the converged EP approximation satis- 
fying the moment matching conditions between p(f,;) and qff?;) corresponds to stationary 
point s of an objective function similar to — log Zep ( Minkal . 2001b : Opper and Winthei . 



20051 ) . The solid gray lines correspond to EP and lEP approximations obtained with the 



proposed nested EP algorithm, and for comparison, the dashed black lines show also the 
quadrature-based QIEP solution. 

The convergence is illustrated both with a small amount of damping, damping factor 
6 = 0.8 (the first and second columns Figure 7), and with a larger amount of damping 5 = 0.5 
(the last column). Note that with the fully coupled nested EP algorithm the damping is 
applied on the inner-EP site parameters Qj and (3i, whereas with lEP and QIEP the damping 
is applied on the natural exponential site parameters u and Tj = (Tj diagonal). In the 
first column (Standard) the inner-loops of the nested EP (and lEP) algorithms are run until 
convergence at each outer-loop iteration, whereas in the remaining columns (Incremental) 
only one inner-loop iteration per site is done at each outer-loop iteration. Based on Table 
[T]the incremental updates (nin = 1) reduce notably the computational burden of the inner- 
loop of the nested EP algorithm which scales as 0(nin(c — 1)'^). 

Figure 7 shows that with standard updates the nested lEP gives almost identical ap- 
proximation compared to the QIEP. However, when incremental updates are used, some 
differences can be seen in the early iterations but both methods converge into the same solu- 
tion. The full nested EP algorithm oscillates less than lEP or QIEP with the same damping 
level but this may be partly caused by the different parameterization. However, with both 
damping levels full EP seems to converge in fewer iterations whereas there is a slow drift 
in — log Zep and the mlpd score even after 20 iterations with nested lEP and QIEP. One 
explanation for this behavior can be the large magnitude hyperparameter value which re- 
sults into strongly non-Gaussian posterior distribution and the relatively large lengthscale 
which induces stronger between-class posterior dependencies through the likelihood terms. 



5.4 Predictive performance across datasets with hyperparameter estimation 

In this section, we compare the predictive performances of nested EP, nested lEP, VB, 
LA, LA-TKP, and Gibbs sampling with the multinomial probit (MCMC) with estimation 
of hyperparameters on various benchmark datasets. All methods are compared using the 
USPS 3 vs. 5 vs. 7 data, and the following five UCI Machine Learning Repository datasets: 
New-thyroid, Teaching, Glass, Wine, and Image segmentation. The comparisons are also 
done using the USPS 10-class dataset, but only for EP, lEP, VB, and LA due to the large 
n. The main characteristics of the datasets are summarized in Table [2j 
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Figure 7: A convergence comparison between EP and lEP using parallel updates in the 
outer EP loop with the USPS 3 vs. 5 vs. 7 data. Plots (a)-(f) show the negative 
log marginal likelihood estimates — log Zep as a function of iterations for two 
different damping factors S, and plots (g)-(l) show the corresponding mean log 
predictive density evaluated using a separate test dataset. The hyperparameters 
of the squared exponential covariance function were fixed at log((T^) = 8 and 
log(/) = 2.5. In the first column (Standard) the inner-loops of the nested EP (and 
lEP) algorithms (the solid gray lines) are run until convergence at each outer- 
loop iteration, whereas in the other columns (Incremental) only one inner-loop 
iteration per site is done at each outer-loop iteration. For comparison, the dashed 
black lines show also the lEP solution obtained with numerical quadratures. 
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Table 2: Datasets used in the experiments. 



Dataset 


Strain 




Classes (c) 


Covariates (d) 


ARD 


New-thyroid 


215 


215 (Ten-fold CV) 


3 


5 


yes 


Teaching 


151 


151 (Ten-fold CV) 


3 


5 


yes 


Glass 


214 


214 (Ten-fold CV) 


6 


9 


yes 


Wine 


178 


178 (Ten-fold CV) 


3 


13 


yes 


Image segmentation 


210 


2100 


7 


18 


no 


USPS 3 vs. 5 vs. 7 


1157 


1175 


3 


256 


no 


USPS 10-class 


4649 


4649 


10 


256 


no 



For all the datasets, we standardize the covariates to zero mean and unit variance, 
and use the squared exponential covariance function with the same hyperparameters for all 
classes. Only one lengthscale parameter is assumed for the Image segmentation and USPS 
datasets, but otherwise individual lengthscale parameter (Automatic Relevance Determi- 
nation, ARD) is set for each input dimension. We place a weakly informative prior on the 
lengthscale and magnitude parameters by choosing a half Student-t distribution with four 
degrees of freedom and a variance equal to one hundred. With MCMC we sample the hyper- 
parameters, and with the other methods, we use gradient-based type-II MAP estimation to 
select the hyperparameter values. The predictive performance is measured using a ten-fold 
cross-validation (CV) with four of the datasets, and predetermined training and test sets 
for three of the datasets (See Table [2]). 

The first and third rows of Figure [8] visualize the mean log predictive predictive densities 
(MLPD) and their 95% credible intervals for six data, s ets es timated using the Bayesian 
bootstrap method as described by Vehtari and Lampinen ( 20021 ). To highlight the differences 
between the methods more clearly, we compute pairwise differences of the log posterior 
predictive densities with respect to EP. The second and fourth rows of Figure [8] show the 
mean values and the 95% credible intervals of the pairwise differences. The comparisons 
reveal that EP performs well when compared to MCMC; only in the Teaching and Image 
segmentation datasets MCMC is significantly better. lEP performs worse than EP in all 
datasets except Teaching and Glass. The predictive densities of VB and LA are overall 
worse than EP, lEP or MCMC. LA-TKP improves the performance of LA with all datasets 
except Teaching. 

The first and third rows of Figure [9] shows the mean classification accuracies and their 
95% credible intervals, and the second and fourth rows show the pairwise mean differences 
of the classification accuracies together with the 95% credible intervals with respect to EP. 
Because the difference of the classification outcomes for each observation is a discrete vari- 
able with three possible values (worse, same, or better than EP), we assume a multinomial 
model with a non- informative Dirichlet prior distribution for the comparison test. In a case 
where the method has exactly the same predictions as EP, a small circle is plotted around 
the mean value. The differences between all methods are small. In the Teaching dataset, 
where the overall accuracy is the lowest, the MCMC estimate is significantly better than 
any other method. There is no statistically significant difference between EP and lEP; 
lEP performs slightly better in the Wine dataset, but EP has better accuracy in the Glass 
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Figure 8: The first and third rows: The mean log predictive densities and their 95% credible 
intervals for six datasets (See Table ^ using EP, lEP, VB, LA, LA-TKP, and 
MCMC with Gibbs sampling. The second and fourth rows: Pairwise differences 
of the log predictive densities with respect to EP (mean -|- 95% credible intervals). 
Values above zero indicate that a method is performing better than EP. 



and Image segmentation datasets, which both have more than three target classes, and in 
which the overall classification accuracies are among the lowest. LA has good classification 
accuracy, and performs better than EP in Image segmentation. A possible explanation for 
this is the different shape of the softmax likelihood function used by LA. If classification 
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Figure 9: The first and third rows: The classification accuracies and their 95 % credible 
intervals for six datasets (See Table [2]) using EP, lEP, VB, LA, LA-TKP, and 
MCMC with Gibbs sampling. The second and fourth rows: Pairwise differences 
of the classification accuracies with respect to EP (mean + 95% credible intervals). 
Values above zero indicate that a method is performing better than EP. A small 
circle at the mean value is plotted if the predictions are exactly the same as with 
EP. 



accuracy is the only criterion, the LA-TKP correction seems unnecessary. VB has the 
lowest classification performance and is significantly worse than the other methods in the 
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Figure 10: The mean log predictive densities (a) and classification accuracies (c) for the 
USPS 10-class dataset (See Table ED using EP, lEP, VB, and LA. The pairwise 
differences of the log predictive densities and the classification accuracies with 
respect to EP are shown in (b) and (d), respectively. In panels (b) and (d) values 
above zero indicate that a method is performing better than EP. In all panels 
mean and 95% credible intervals are shown. 



Image segmentation and USPS 3 vs. 5 vs. 7 datasets, which is probably caused by a worse 
estimate of the hyperparameter values. 

Finally, we summarize the MLPD scores and classification accuracies of EP, lEP, VB, 
and LA with the USPS 10-class dataset in Figure [TOl Both EP approaches are significantly 
better than VB or LA with both measures. Considering the EP approaches, EP achieves 
slightly better MLPD score, whereas lEP is slightly better in terms of classification accuracy, 
but the differences are not statistically significant. 



6. Conclusions and further research 

EP approac hes for GP cla s sifica tion with the multinomial p robit model have already been 
proposed by Seeger et al. ( 20061 ) and Girolami and Zhong ( 2007 ). In this paper, we have 



complemented their work with a novel quadrature-free nested EP algorithm that maintains 
all between-class posterior dependencies but still scales linearly in the number of classes. 
Our comparisons show that when the hyperparameters are determined by optimizing the 
marginal likelihood, nested EP is a consistent approximate method compared to full MCMC. 
In terms of predictive density, nested EP is close to MCMC, and more accurate compared 
to VB and LA, but if only the classification accuracy is concerned, all the approximations 
perform similarly. LA-TKP improves the predictive density estimates of LA but the com- 
putational cost becomes increasingly demanding if larger number of predictions are needed. 

In our comparisons the predictive accuracies of the full EP and lEP solutions obtained 
using the nested EP algorithm are similar for practical purposes. However, our visualizations 
show that the approximate marginal posterior distributions of the latent values provided by 
full EP are clearly more accurate, although the full nested EP solution can be calculated with 
similar computational burden than nested lEP. Because there is no convergence guarantee 
for the standard EP algorithm, it is worth to notice the differences in the convergence 
properties of full EP and lEP observed in our experiments. With the same hyperparameter 
values, nested lEP converged more slowly and required more damping than full nested 
EP. This can be due to slower propagation of information caused by the independence 



26 



Nested EP for GP classification with Multinomial Probit 



assumptions, and this behavior can get worse as the between-class posterior couphngs get 
stronger with certain hyperparameter values. Given all these observations, we prefer full 
EP over lEP. 

Models in which each likelihood term related to a certain observation depends on multi- 
ple latent values, such as the multinomial probit, are challenging for EP because a straight- 
forward quadrature-based implementation may become computationally infeasible unless 
independence assumptions between the latent values or some other simplifications are made. 
In the presented nested EP approach, we have applied inner EP approximations for each 
likelihood term within an outer EP framework in a computationally efficient manner. This 
approach could be applicable also for other similar multi-latent models which admit inte- 
gral representations consisting of simple factorized functions each depending on a linear 
transformation of the latent variables. 

A drawback with GP classifiers is the fundamental computational scaling O(n^) resulting 
from the prior structure. To speed up the inference in multiclass GP classification, sparse 
approximations such as the in formative vector ma chine flVM) have been proposed f Seeger 
and Jordan, 2004: Girolami and Rogers, 2006; Scc ger et al.l . l200fih ." IVM uses the information 
provided by all observations to form an active subset which is then used to form the posterior 
mean and covariance approximations. The presented EP approa ch could be extended to 
IVM in a similar fashion as described bv lSeeger and Jordan! (j2004l ). The accurate marginal 
approximations of full EP could be useful in determining the relative entropy measures used 
as a scoring criterion to select the active set. To speed up the computations, the inner EP 
site parameters could be updated iteratively even for the observations not in the active set 
in a similar fashion as described in Section 13.21 Recently, a simi lar approach to IVM called 
predictive active set selection (PASS-GP) has been proposed bv iHenao and Wintherl (j2010l ) 
to lower the computational complexity in binary GP classification. PASS-GP uses the 
approximate cavity and cavity predictive distributions of EP to determine a representative 
active set. The proposed EP approach could prove useful when extending PASS-GP to 
multiple classes, because it provides accurate marginal predictive density estimates. 



27 



RlIHIMAKI, JyLANKI AND VEHTARI 



Appendix A. Approximating tilted moments using EP 

For convenience, we summarize the inner EP algorithm for approximating the tilted mo- 
ments resulti ng from a mu ltinomial probit likelihood. Essentially the same algorithm was 



pres ented bvlMinkal (l2001ar) for classification with the Baves point machine and later by Qi 



et al. ( 20041 ) for the binary probit classifier. To facilitate a computationally efficient imple- 



mentation, the following algorithm description is written with an emphasis to reduce the 
number of ve ctor and matrix op e ration s in a similar fashion as in the general EP formulation 
presented by Cseke and Heske^ ( 2011 . appendix C). 



We want to approximate the normalization, mean and covariance of the tilted distribu- 
tion 

c 

p(w,) = ^r'-^^(w^|/Xw.,SwJ n ^(wfb,,,). (28) 
This is done with the EP algorithm which results in the Gaussian approximation 

c 

g(w,) = z,::W(w,|/^„^,s^j n (29) 

where we have used the natural parameters aij (precision) and (location) for the site 
approximations. The index i denotes the z'th observation, and to clarify the notation below, 
we leave out this index from the inner EP terms. In the first outer- loop, the site parameters 
a and /3 are initialized to zero, fXg. to fi^^, and Eg. to Swi- After the first outer-loop, these 
parameters are initialized to their last values from the previous iteration for speed-up. The 
following steps are repeated until convergence. 

1. Cavity evaluations: 

v., = (vf-aj)-' (30) 
m_j = v-j{v~^mj - Pj), (31) 

where scalars Vj = b^^S^-bjj- and rrij = hjjfx^. corresponds to marginal distribution 
of latent wfhij. 



2. Tilted moments: 



Zj = <^iz,) (32) 

rhj = PjV-j + ma-j (33) 

= v-j-vljjj, (34) 

where Zj = m^j{l + v^j)-^/^ , pj = ^^{l + v^j)-^/^ and 7^- = + ZjPjil + v^j)-^^ . 

3. Site updates with damping: 



Aaj = S{v-^-v-^) (35) 
APj = 5{vJ^mj -v-^nij), (36) 

where 6 € (0, 1] is the damping factor. 



28 



Nested EP for GP classification with Multinomial Probit 



4. Rank-1 covariance update: 

= i:^^-^j{l + AajVj)-^Adj^J (37) 
= /i^,+i9,-(l + Aa,-t;,)-i(A/3, -Ad,m,-), (38) 

where i?i = Sg.bjj. 

Alternatively, the rank-1 updates of step 4 could be replaced by only one parallel covariance 
update after each sweep over the sites indexed by j. 

Appendix B. Details of posterior computations 

The site covariance isT = D — DR{R^DR)^^R^D, where the matrix R is a cnxn matrix of 
c times stacked identity matrices /„, and D = diag [vrj, . . . , vr^, vr^, . . . , tt^, . . . , vrj, . . . , tt^] . 
To compute predictions at a test point , we need to first evaluate the mean and covariance 
off*= [/i,/2,...,/^]^as 

E[f,] = K^{I - MK)u (39) 
Cov[f,] = i^,,, - K^MK^, (40) 

where M = f{I + KT)-'^, and is the c X cn covariance matrix between the test point 
and the training points, and -fC*^* is the c x c covariance matrix for the test point. The 
matrix M in the equations ([39]) and (HOl) can be evaluated using 

M = B- BRP-^R^B, (41) 

where B = D'^/^A^^D^/^, P = R^ D^/^A-^D^/'^R, and A = I+D^/^KD^/^. To evaluate M, 
we compute the Cholesky decompositions of P and the c diagonal blocks of A, which results 
in the scaling 0{{c + l)n^). The predictive mean and covariance can be computed by using 
the block diagonal structure of B and the sparse structure K^. Given E[f*] and Cov[f=i,], 
the integration over the posterior uncertainty of f required to compute the predictive class 
probabilities, is equivalent to the tilted moment evaluation, and can be approximated as 
described in appendix A. 

The marginal likelihood approximation of EP can be computed as 

logZEP = ^i>V-^log|/ + ifT| + ^logZ^^ + i^(/x^,S:i/i_, + log|S_i|) 

i i 

-^^(/xfS-V. + log|E,|), (42) 

i 

where the vector /Xj of length c and the c x c matrix are the i'th marginal mean and 
covariance. Similarly, and S_j are the cavity mean and covariance. The moments Zi 
in (I42p are the normalization terms from the algorithm described in appendix A. Finally, 
the determinant term in (jl2|) can be evaluated as 

\I + Kf\ = \A\\R^DR\-^\P\. (43) 
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The gradients of the log marginal likelihood with respect to 6 can be obtained by cal- 
culating only the explicit derivatives of the first two terms of (j42p . The implicit derivatives 
with respect to the site parameters and cavity parameters (in their natural exponential 
forms) of the outer EP cancel each other out in the convergence (jOpper and Wintheri . 12005 ; 



Seegerl . l2005l ). Since the likelihood does not depend on any hyperparameters, the explicit 
derivatives of log Z^. are zero. Also the implicit derivatives of log Z^. with respect to the 
inner EP parameters cancel out because these terms are formed as marginal likelihood ap- 
proximations with the inner EP, which has the same previously mentioned property of the 
EP algorithm. 
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